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Abstract 



Most nervous systems encode information about stimuli in the respond- 
ing activity of large neuronal networks. This activity often manifests 
itself as dynamically coordinated sequences of action potentials. Since 
multiple electrode recordings are now a standard tool in neuroscience 
research, it is important to have a measure of such network-wide behav- 
ioral coordination and information sharing, applicable to multiple neural 
spike train data. We propose a new statistic, informational coherence, 
which measures how much better one unit can be predicted by knowing 
the dynamical state of another. We argue informational coherence is a 
measure of association and shared information which is superior to tradi- 
tional pairwise measures of synchronization and correlation. To find the 
dynamical states, we use a recently-introduced algorithm which recon- 
structs effective state spaces from stochastic time series. We then extend 
the pairwise measure to a multivariate analysis of the network by estimat- 
ing the network multi-information. We illustrate our method by testing it 
on a detailed model of the transition from gamma to beta rhythms. 

Much of the most important information in neural systems is shared over multiple neu- 
rons or cortical areas, in such forms as population codes and distributed representations 
(Q. On behavioral time scales, neural information is stored in temporal patterns of ac- 
tivity as opposed to static markers; therefore, as information is shared between neurons 
or brain regions, it is physically instantiated as coordination between entire sequences of 
neural spikes. Furthermore, neural systems and regions of the brain often require coor- 
dinated neural activity to perform important functions; acting in concert requires multiple 
neurons or cortical areas to share information 1 2 1 . Thus, if we want to measure the dynamic 
network-wide behavior of neurons and test hypotheses about them, we need reliable, practi- 
cal methods to detect and quantify behavioral coordination and the associated information 
sharing across multiple neural units. These would be especially useful in testing ideas 
about how particular forms of coordination relate to distributed coding (e.g., that of |3|). 
Current techniques to analyze relations among spike trains handle only pairs of neurons, so 
we further need a method which is extendible to analyze the coordination in the network, 
system, or region as a whole. Here we propose a new measure of behavioral coordination 
and information sharing, informational coherence, based on the notion of dynamical state. 

Section^argues that coordinated behavior in neural systems is often not captured by exist- 



ing measures of synchronization or correlation, and that something sensitive to nonlinear, 
stochastic, predictive relationships is needed. Section |2 defines informational coherence 
as the (normalized) mutual information between the dynamical states of two systems and 
explains how looking at the states, rather than just observables, fulfills the needs laid out in 
Section^ Since we rarely know the right states a prori, Section l2~Tl brieflv describes how 
we reconstruct effective state spaces from data. Section l2~2l gives some details about how 
we calculate the informational coherence and approximate the global information stored 
in the network. Section [3] applies our method to a model system (a biophysically detailed 
conductance-based model) comparing our results to those of more familiar second-order 
statistics. In the interest of space, we omit proofs and a full discussion of the existing lit- 
erature, giving only minimal references here; proofs and references will appear in a longer 
paper now in preparation. 

1 Synchrony or Coherence? 

Most hypotheses which involve the idea that information sharing is reflected in coordinated 
activity across neural units invoke a very specific notion of coordinated activity, namely 
strict synchrony: the units should be doing exactly the same thing (e.g., spiking) at exactly 
the same time. Investigators then measure coordination by measuring how close the units 
come to being strictly synchronized (e.g., variance in spike times). 

From an informational point of view, there is no reason to favor strict synchrony over 
other kinds of coordination. One neuron consistently spiking 50 ms after another is just as 
informative a relationship as two simultaneously spiking, but such stable phase relations 
are missed by strict-synchrony approaches. Indeed, whatever the exact nature of the neural 
code, it uses temporally extended patterns of activity, and so information sharing should be 
reflected in coordination of those patterns, rather than just the instantaneous activity. 

There are three common ways of going beyond strict synchrony: cross-correlation and 
related second-order statistics, mutual information, and topological generalized synchrony. 

The cross-correlation function (the normalized covariance function; this includes, for 
present purposes, the joint peristimulus time histogram 1 2 1), is one of the most widespread 
measures of synchronization. It can be efficiently calculated from observable series; it 
handles statistical as well as deterministic relationships between processes; by incorporat- 
ing variable lags, it reduces the problem of phase locking. Fourier transformation of the 
covariance function jxY(h) yields the cross-spectrum Fxy{v), which in turn gives the 
spectral coherence cxy(v) = F XY {v) / Fx{v)Fy{v), a normalized correlation between 
the Fourier components of X and Y. Integrated over frequencies, the spectral coherence 
measures, essentially, the degree of linear cross-predictability of the two series. (|4| applies 
spectral coherence to coordinated neural activity.) However, such second-order statistics 
only handle linear relationships. Since neural processes are known to be strongly non- 
linear, there is little reason to think these statistics adequately measure coordination and 
synchrony in neural systems. 

Mutual information is attractive because it handles both nonlinear and stochastic relation- 
ships and has a very natural and appealing interpretation. Unfortunately, it often seems 
to fail in practice, being disappointingly small even between signals which are known to 
be tightly coupled |5|. The major reason is that the neural codes use distinct patterns of 
activity over time, rather than many different instantaneous actions, and the usual approach 
misses these extended patterns. Consider two neurons, one of which drives the other to 
spike 50 ms after it does, the driving neuron spiking once every 500 ms. These are very 
tightly coordinated, but whether the first neuron spiked at time t conveys little information 
about what the second neuron is doing at t — it's not spiking, but it's not spiking most of 
the time anyway. Mutual information calculated from the direct observations conflates the 



"no spike" of the second neuron preparing to fire with its just-sitting-around "no spike". 
Here, mutual information could find the coordination if we used a 50 ms lag, but that won't 
work in general. Take two rate-coding neurons with base-line firing rates of 1 Hz, and sup- 
pose that a stimulus excites one to 10 Hz and suppresses the other to 0.1 Hz. The spiking 
rates thus share a lot of information, but whether the one neuron spiked at t is uninformative 
about what the other neuron did then, and lagging won't help. 

Generalized synchrony is based on the idea of establishing relationships between the states 
of the various units. "State" here is taken in the sense of physics, dynamics and control 
theory: the state at time t is a variable which fixes the distribution of observables at all 
times > t, rendering the past of the system irrelevant |6|. Knowing the state allows us to 
predict, as well as possible, how the system will evolve, and how it will respond to external 
forces [7 1. Two coupled systems are said to exhibit generalized synchrony if the state of one 
system is given by a mapping from the state of the other. Applications to data employ state- 
space reconstruction |8|: if the state x G X evolves according to smooth, <i-dimensional 
deterministic dynamics, and we observe a generic function y = f(x), then the space y 
of time-delay vectors [y(t),y(t — r), ...y(t — (fc — l)r)] is diffeomorphic to X if k > 2d, 
for generic choices of lag r. The various versions of generalized synchrony differ on how, 
precisely, to quantify the mappings between reconstructed state spaces, but they all appear 
to be empirically equivalent to one another and to notions of phase synchronization based 
on Hilbert transforms 1 5 1 . Thus all of these measures accommodate nonlinear relationships, 
and are potentially very flexible. Unfortunately, there is essentially no reason to believe that 
neural systems have deterministic dynamics at experimentally-accessible levels of detail, 
much less that there are deterministic relationships among such states for different units. 

What we want, then, but none of these alternatives provides, is a quantity which measures 
predictive relationships among states, but allows those relationships to be nonlinear and 
stochastic. The next section introduces just such a measure, which we call "informational 
coherence". 



2 States and Informational Coherence 

There are alternatives to calculating the "surface" mutual information between the se- 
quences of observations themselves (which, as described, fails to capture coordination). 
If we know that the units are phase oscillators, or rate coders, we can estimate their instan- 
taneous phase or rate and, by calculating the mutual information between those variables, 
see how coordinated the units' patterns of activity are. However, phases and rates do not 
exhaust the repertoire of neural patterns and a more general, common scheme is desirable. 
The most general notion of "pattern of activity" is simply that of the dynamical state of the 
system, in the sense mentioned above. We now formalize this. 

Assuming the usual notation for Shannon information |9|, the information content of a 
state variable X is H[X] and the mutual information between X and Y is I[X; Y]. As 
is well-known, I[X;Y] < mm H[X],H[Y]. We use this to normalize the mutual state 
information to a — 1 scale, and this is the informational coherence (IC). 

HX,Y) = with 0/0 = 0. (1) 

mmH [X J, H [Y\ 

ijj can be interpreted as follows. I[X; Y] is the Kullback-Leibler divergence between the 
joint distribution of X and Y, and the product of their marginal distributions 1 9 1, indicating 
the error involved in ignoring the dependence between X and Y. The mutual information 
between predictive, dynamical states thus gauges the error involved in assuming the two 
systems are independent, i.e., how much predictions could improve by taking into account 
the dependence. Hence it measures the amount of dynamically-relevant information shared 



between the two systems, ip simply normalizes this value, and indicates the degree to 
which two systems have coordinated patterns of behavior (cf. 1 10 1, although this only uses 
directly observable quantities). 

2.1 Reconstruction and Estimation of Effective State Spaces 

As mentioned, the state space of a deterministic dynamical system can be reconstructed 
from a sequence of observations. This is the main tool of experimental nonlinear dynamics 
1 8 1; but the assumption of determinism is crucial and false, for almost any interesting neural 
system. While classical state-space reconstruction won't work on stochastic processes, 
such processes do have state-space representations 1 1 1 1, and, in the special case of discrete- 
valued, discrete-time series, there are ways to reconstruct the state space. 

Here we use the CSSR algorithm, introduced in 1121 (code available at 
http : / / bact ra.org/CSSR). This produces causal state models, which are stochastic 
automata capable of statistically-optimal nonlinear prediction; the state of the machine 
is a minimal sufficient statistic for the future of the observable process ! 1 31 . 1 The basic 
idea is to form a set of states which should be (1) Markovian, (2) sufficient statistics for 
the next observable, and (3) have deterministic transitions (in the automata-theory sense). 
The algorithm begins with a minimal, one-state, IID model, and checks whether these 
properties hold, by means of hypothesis tests. If they fail, the model is modified, generally 
but not always by adding more states, and the new model is checked again. Each state 
of the model corresponds to a distinct distribution over future events, i.e., to a statistical 
pattern of behavior. Under mild conditions, which do not involve prior knowledge of 
the state space, CSSR converges in probability to the unique causal state model of the 
data-generating process 1121 . In practice, CSSR is quite fast (linear in the data size), and 
generalizes at least as well as training hidden Markov models with the EM algorithm and 
using cross-validation for selection, the standard heuristic l fl2l . 

One advantage of the causal state approach (which it shares with classical state-space re- 
construction) is that state estimation is greatly simplified. In the general case of nonlinear 
state estimation, it is necessary to know not just the form of the stochastic dynamics in 
the state space and the observation function, but also their precise parametric values and 
the distribution of observation and driving noises. Estimating the state from the observable 
time series then becomes a computationally-intensive application of Bayes's Rule 1171 . 

Due to the way causal states are built as statistics of the data, with probability 1 there is a 
finite time, t, at which the causal state at time t is certain. This is not just with some degree 
of belief or confidence: because of the way the states are constructed, it is impossible for the 
process to be in any other state at that time. Once the causal state has been established, it can 
be updated recursively, i.e., the causal state at time t + 1 is an explicit function of the causal 
state at time t and the observation at t + 1. The causal state model can be automatically 
converted, therefore, into a finite-state transducer which reads in an observation time series 
and outputs the corresponding series of states 1181 1131 . (Our implementation of CSSR 
filters its training data automatically.) The result is a new time series of states, from which 
all non-predictive components have been filtered out. 

2.2 Estimating the Coherence 

Our algorithm for estimating the matrix of informational coherences is as follows. For each 
unit, we reconstruct the causal state model, and filter the observable time series to produce 
a series of causal states. Then, for each pair of neurons, we construct a joint histogram of 

'Causal state models have the same expressive power as observable operator models 1 14 1 or pre- 
dictive state representations |7|, and greater power than variable-length Markov models I15II16I . 
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Figure 1 : Rastergrams of neuronal spike-times in the network. Excitatory, pyramidal neurons (num- 
bers 1 to 1000) are shown in green, inhibitory intemeurons (numbers 1001 to 1300) in red. During the 
first 10 seconds (a), the current connections among the pyramidal cells are suppressed and a gamma 
rhythm emerges (left). At t = 10s, those connections become active, leading to a beta rhythm (b, 
right). 



the state distribution, estimate the mutual information between the states, and normalize by 
the single-unit state informations. This gives a symmetric matrix of ip values. 

Even if two systems are independent, their estimated IC will, on average, be positive, be- 
cause, while they should have zero mutual information, the empirical estimate of mutual 
information is non-negative. Thus, the significance of IC values must be assessed against 
the null hypothesis of system independence. The easiest way to do so is to take the re- 
constructed state models for the two systems and run them forward, independently of one 
another, to generate a large number of simulated state sequences; from these calculate val- 
ues of the IC. This procedure will approximate the sampling distribution of the IC under a 
null model which preserves the dynamics of each system, but not their interaction. We can 
then find p-values as usual. We omit them here to save space. 

2.3 Approximating the Network Multi-Information 

There is broad agreement [2| that analyses of networks should not just be an analysis of 
pairs of neurons, averaged over pairs. Ideally, an analysis of information sharing in a net- 
work would look at the over-all structure of statistical dependence between the various 
units, reflected in the complete joint probability distribution P of the states. This would 
then allow us, for instance, to calculate the n-fold multi-information, I[Xx, X 2 , ■ ■ ■ X n ] 
1)1' ( ) ., the Kullback-Leibler divergence between the joint distribution P and the prod- 
uct of marginal distributions Q, analogous to the pairwise mutual information 1191 . Cal- 
culated over the predictive states, the multi-information would give the total amount of 
shared dynamical information in the system. Just as we normalized the mutual information 
I[Xi,X2] by its maximum possible value, miniI[Xi], H\X 2 \, we normalize the multi- 
information by its maximum, which is the smallest sum of n — 1 marginal entropies: 

I[X V ,X 2 ; . . . X n ] < mmY,H[X n ] 

Unfortunately, P is a distribution over a very high dimensional space and so, hard to esti- 
mate well without strong parametric constraints. We thus consider approximations. 

The lowest-order approximation treats all the units as independent; this is the distribution 
Q. One step up are tree distributions, where the global distribution is a function of the joint 
distributions of pairs of units. Not every pair of units needs to enter into such a distribution, 



though every unit must be part of some pair. Graphically, a tree distribution corresponds to a 
spanning tree, with edges linking units whose interactions enter into the global probability, 
and conversely spanning trees determine tree distributions. Writing Et for the set of pairs 
(i, j) and abbreviating X\ = Xi, X2 = ■ ■ ■ X n — x n by X = x, one has 

^""JL ^-i' B^^ <2) 

where the marginal distributions T(Xi) and the pair distributions T(Xi, Xj) are estimated 
by the empirical marginal and pair distributions. 

We must now pick edges Et so that T best approximates the true global distribution P. 
A natural approach is to minimize D(P\\T), the divergence between P and its tree ap- 
proximation. Chow and Liu 1 20 1 showed that the maximum-weight spanning tree gives the 
divergence-minimizing distribution, taking an edge's weight to be the mutual information 
between the variables it links. 

There are three advantages to using the Chow-Liu approximation. (1) Estimating T from 
empirical probabilities gives a consistent maximum likelihood estimator of the ideal Chow- 
Liu tree [20|, with reasonable rates of convergence, so T can be reliably known even if 
P cannot. (2) There are efficient algorithms for constructing maximum-weight spanning 
trees, such as Prim's algorithm 1211 sec. 23.2], which runs in time 0(n 2 + n log n). Thus, 
the approximation is computationally tractable. (3) The KL divergence of the Chow-Liu 
distribution from Q gives a lower bound on the network multi-information; that bound is 
just the sum of the mutual informations along the edges in the tree: 

I[X 1 ;X 2 ;...X n ]>D(T\\Q)= £ X,-] (3) 

(hj)eE T 

Even if we knew P exactly, Eq.[3]would be useful as an alternative to calculating D(P\\Q) 
directly, evaluating logP(x)/Q(x) for all the exponentially-many configurations x. 

It is natural to seek higher-order approximations to P, e.g., using three-way interactions 
not decomposable into pairwise interactions 1221 1 191 . But it is hard to do so effectively, 
because finding the optimal approximation to P when such interactions are allowed is NP 
1231 . and analytical formulas like Eq.[3] generally do not exist 1191 . We therefore confine 
ourselves to the Chow-Liu approximation here. 



3 Example: A Model of Gamma and Beta Rhythms 

We use simulated data as a test case, instead of empirical multiple electrode recordings, 
which allows us to try the method on a system of over 1000 neurons and compare the 
measure against expected results. The model, taken from 1241 . was originally designed to 
study episodes of gamma (30-80Hz) and beta (12-30Hz) oscillations in the mammalian 
nervous system, which often occur successively with a spontaneous transition between 
them. More concretely, the rhythms studied were those displayed by in vitro hippocampal 
(CA1) slice preparations and by in vivo neocortical EEGs. 

The model contains two neuron populations: excitatory (AMPA) pyramidal neurons and 
inhibitory (GABA^) interneurons, defined by conductance-based Hodgkin-Huxley-style 
equations. Simulations were carried out in a network of 1000 pyramidal cells and 300 
interneurons. Each cell was modeled as a one-compartment neuron with all-to-all coupling, 
endowed with the basic sodium and potassium spiking currents, an external applied current, 
and some Gaussian input noise. 

The first 10 seconds of the simulation correspond to the gamma rhythm, in which only a 
group of neurons is made to spike via a linearly increasing applied current. The beta rhythm 
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Figure 2: Heat-maps of coordination for the network, as measured by zero-lag cross-correlation 
(top row) and informational coherence (bottom), contrasting the gamma rhythm (left column) with 
the beta (right). Colors run from red (no coordination) through yellow to pale cream (maximum). 



(subsequent 10 seconds) is obtained by activating pyramidal-pyramidal recurrent connec- 
tions (potentiated by Hebbian preprocessing as a result of synchrony during the gamma 
rhythm) and a slow outward after-hyper-polarization (AHP) current (the M-current), sup- 
pressed during gamma due to the metabotropic activation used in the generation of the 
rhythm. During the beta rhythm, pyramidal cells, silent during gamma rhythm, fire on a 
subset of interneurons cycles (Fig.Q. 

Fig. |2 compares zero-lag cross-correlation, a second-order method of quantifying coor- 
dination, with the informational coherence calculated from the reconstructed states. (In 
this simulation, we could have calculated the actual states of the model neurons directly, 
rather than reconstructing them, but for purposes of testing our method we did not.) Cross- 
correlation finds some of the relationships visible in Fig.^ but is confused by, for instance, 
the phase shifts between pyramidal cells. (Surface mutual information, not shown, gives 
similar results.) Informational coherence, however, has no trouble recognizing the two pop- 
ulations as effectively coordinated blocks. The presence of dynamical noise, problematic 
for ordinary state reconstruction, is not an issue. The average IC is 0.411 (or 0.797 if the 
inactive, low-numbered neurons are excluded). The tree estimate of the global informa- 
tional multi-information is 3243.7 bits, with a global coherence of 0.777. The right half of 



Fig- Erepeats this analysis for the beta rhythm; in this stage, the average IC is 0.614, and 
the tree estimate of the global multi-information is 7377.7 bits, though the estimated global 
coherence falls very slightly to 0.742. This is because low-numbered neurons which were 
quiescent before are now active, contributing to the global information, but the over-all 
pattern is somewhat weaker and more noisy (as can be seen from Fig.^j.) So, as expected, 
the total information content is higher, but the overall coordination across the network is 
lower. 

4 Conclusion 

Informational coherence provides a measure of neural information sharing and coordinated 
activity which accommodates nonlinear, stochastic relationships between extended patterns 
of spiking. It is robust to dynamical noise and leads to a genuinely multivariate measure 
of global coordination across networks or regions. Applied to data from multi-electrode 
recordings, it should be a valuable tool in evaluating hypotheses about distributed neural 
representation and function. 
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